Wave 2 polish: fix shader build, remove hardcoded paths, add CI by Peterc3-dev · Pull Request #1 · Peterc3-dev/torch-vulkan

Peterc3-dev · 2026-05-29T17:16:36Z

Summary

Polish pass on the Vulkan-compute PyTorch backend. Focus: a real reproducible-build bug in the shader toolchain, removing hardcoded/leaked developer paths, and adding CI for what can be checked without a GPU.

Changes

Shader build (real bug, fixed + verified)

csrc/shaders/compile.sh used glslangValidator -V, which emits SPIR-V 1.0. The 12 subgroup-using quantized kernels (matmul_q4k/q5k/q6k, their _batch variants, and matmul_gpuq4/5/6, plus argmax) use GL_KHR_shader_subgroup reductions that require SPIR-V >= 1.3, so the script could not rebuild them. Switched to --target-env vulkan1.1 (SPIR-V 1.3). Added set -euo pipefail and a non-zero exit if any shader fails.
Verified locally with the system glslangValidator: all 38 shaders compile (12 of which failed before).

Hardcoded / leaked paths removed (portability)

csrc/vulkan_engine.cpp hardcoded /home/raz/projects/torch-vulkan/csrc/shaders/. Removed it; the engine now falls back to the TORCH_VULKAN_SHADER_DIR env var.
torch_vulkan/__init__.py now exports the resolved bundled-shader directory into TORCH_VULKAN_SHADER_DIR so the lazily-constructed VulkanEngine finds the same shaders the Kompute context uses.
tests/test_algo_cache.py and tests/bench_layer.py hardcoded /home/raz/builds/pytorch-gfx1150 on sys.path. Replaced with an optional TORCH_VULKAN_PYTORCH_PATH env var.

Tests

tests/test_mm.py: the old test_cpu_fallback claimed "relu isn't implemented" — but relu is wired to the Vulkan backend. Split into test_relu (real correctness check vs CPU) and test_unimplemented_op_falls_back_to_cpu (uses sign, which has no Vulkan impl, to actually exercise the boxed CPU fallback).

Lint / cleanup

Removed dead imports (numpy, time in persistent_pipeline.py; sys in setup.py) and unused benchmark locals.
Aligned the __init__.py module docstring with the README: .to("vulkan") is the supported tensor-creation path; torch.randn(..., device="vulkan") and .vulkan() are only partially wired.
README quickstart now compiles shaders and uses cmake -S . -B build (the old cd build referenced a gitignored, non-existent dir).

CI (new)

.github/workflows/ci.yml: compiles every shader with --target-env vulkan1.1 (guards the regression above) and runs ruff + py_compile.
pyproject.toml: ruff config (E,F,I; E402 ignored because the package intentionally imports torch, then loads _C, then aliases into torch.vulkan).

Verified

Shader compilation: all 38 .comp compile with the new flag (ran locally).
Ruff: clean (ruff 0.15.15, ran locally on the full tree).
py_compile: passes for all Python files.
Committed .spv binaries were intentionally left untouched — recompiling them locally produced different bytes (different glslang build) that I can't validate against the target GPU, and the committed ones are the verified artifacts per the README.

UNVERIFIED (no toolchain on this machine)

C++ extension build NOT run. cmake is not installed, the Kompute dependency is absent, and there is no Vulkan dev headers / GPU here. The C++ edits are minimal and mechanical (one #include <cstdlib>, a std::getenv fallback block, a comment-only change in torch_vulkan.cpp) but were not compiled.
Runtime test suite (pytest tests/) NOT run — requires a Vulkan GPU + Kompute + the custom PyTorch build. No correctness/perf numbers were re-measured; none are claimed here.

TODOs left (not addressed — intent unclear / out of scope)

The many "written but not wired" shaders (Q5_K/Q6_K dispatch, rope, rmsnorm, silu_gate, kv-cache attention, etc.) remain roadmap items with no host dispatch.
mm_raw (raw VulkanEngine path) is implemented but still not registered to any aten op.

🤖 Generated with Claude Code

- compile.sh: target Vulkan 1.1 (SPIR-V 1.3) so the 12 subgroup-using quantized kernels (matmul_q*k*, matmul_gpuq*) actually compile; the default `glslangValidator -V` emits SPIR-V 1.0 and failed on them. Added set -euo pipefail and a non-zero exit on any failure. - vulkan_engine.cpp: drop the hardcoded /home/raz/... shader path (leaked a developer path and broke on every other machine). Fall back to the TORCH_VULKAN_SHADER_DIR env var instead. - __init__.py: export the resolved bundled-shader dir into TORCH_VULKAN_SHADER_DIR so the lazily-constructed VulkanEngine resolves the same shaders. Aligned the module docstring with the README (.to("vulkan") is the supported path; torch.randn(device=) / .vulkan() are partial). - tests: removed hardcoded /home/raz/builds/pytorch-gfx1150 sys.path inserts; honour TORCH_VULKAN_PYTORCH_PATH instead. Fixed the misleading "relu isn't implemented" fallback test (relu IS wired) and split it into a real relu correctness test plus a genuine CPU-fallback test using an unimplemented op. - Removed dead imports (numpy, time, sys) and unused locals flagged by ruff. - Added pyproject.toml (ruff config) and .github/workflows/ci.yml that compiles all shaders and runs ruff + py_compile. The GPU build/test suite is not run in CI (no GPU on hosted runners). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Peterc3-dev · 2026-05-29T17:19:05Z

Independent verification — verdict: solid (mergeable)

Re-checked every concrete claim on a separate machine (glslangValidator + ruff present; cmake/Vulkan GPU absent, so C++ build/pytest remain unverified — matching the PR's own UNVERIFIED section).

Shader fix — confirmed real and correct

Old glslangValidator -V (SPIR-V 1.0) fails on exactly the 12 quantized matmul kernels with 'subgroup op' : requires SPIR-V 1.3. argmax.comp uses subgroup ops too but compiles under the old flag — the PR correctly claimed only 12 failed.
New --target-env vulkan1.1 compiles all 38 .comp cleanly.
set -euo pipefail + the if ! glslang... guard: verified the script now exits 1 on a deliberately-broken shader (glslangValidator's own non-zero exit propagates). The old script silently swallowed failures.
Committed .spv left untouched — re-ran compile, then git checkout restored a clean tree as described.

Paths — confirmed removed

No /home/raz strings remain anywhere in shipped code (csrc/, torch_vulkan/, setup.py, tests/). Env-var fallbacks are correct.
VulkanEngine shaderDir_ is now empty if TORCH_VULKAN_SHADER_DIR is unset, but that engine is only reachable via mm_raw, which is not registered to any aten op, and __init__.py always sets the env var via os.environ.setdefault at import. No runtime regression for the live op path (which uses VulkanContext, untouched).

Tests — correct

relu IS registered (m.impl("relu", &relu) + relu.spv), so the old test_cpu_fallback was mislabeled. sign is registered nowhere, and the boxed CPU fallback IS registered (m.fallback(...)), so test_unimplemented_op_falls_back_to_cpu is a valid fallback exercise.

Lint/CI — confirmed

ruff 0.15.15 clean; py_compile passes. Removed imports (time, numpy, sys) were genuine F401s on master. CI workflow guards the shader regression and scopes out the GPU build honestly.

Overclaims: none. The body is unusually disciplined about the build/pytest gap — independently confirmed cmake is unavailable here, so those remain legitimately unverified rather than overclaimed.

Recommend merge.

Peterc3-dev merged commit a344d77 into master May 30, 2026
2 checks passed

Peterc3-dev deleted the wave2-polish branch May 30, 2026 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wave 2 polish: fix shader build, remove hardcoded paths, add CI#1

Wave 2 polish: fix shader build, remove hardcoded paths, add CI#1
Peterc3-dev merged 1 commit into
masterfrom
wave2-polish

Peterc3-dev commented May 29, 2026

Uh oh!

Peterc3-dev commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Peterc3-dev commented May 29, 2026

Summary

Changes

Verified

UNVERIFIED (no toolchain on this machine)

TODOs left (not addressed — intent unclear / out of scope)

Uh oh!

Peterc3-dev commented May 29, 2026

Independent verification — verdict: solid (mergeable)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant